The idea of this R notebook is to introduce everyone interested in data science to effective communication of data and statistical findings with suitable visualisations. On the side, we also take a look at a way to analyse the evolution of Tesla’s stock price and Elon Musk’s tweets. For the purpose of visualisation, the ggplot2 and plotly packages are used since they enable producing high-quality, publication-ready visualisations for static as well as dynamic and interactive applications. Both packages are built around the framework of the so-called Grammar of Graphics, a scientific syntax for effective data visualisations, which describes how specific elements or layers of a plot should be seperated and classified for a structured approach to visualisations. For more information, see Hadley Wickham (2010) - A Layered Grammar of Graphics and Wilkinson (2011) - The Grammar of Graphics.

Great resources to check out:

#

1 Package Settings

# Global chunk settings
# Load packages

# TODO: Create automatic package installation for students

library(conflicted)
library(gapminder)
library(httr)
library(rtweet)
library(quantmod)
library(Quandl)
library(pins)
library(tidyverse)
library(lubridate)
library(tsbox)
library(DT)
library(ggrepel)
library(plotly)
library(wordcloud2)
library(viridis)
library(viridisLite)
library(RColorBrewer)
library(bookdown)

# Conflicted: hierarchy in case of conflict

conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("first", "dplyr")
conflict_prefer("last", "dplyr")
conflict_prefer("lag", "dplyr")
conflict_prefer("layout", "plotly")

2 Plotting Settings

# Color settings

# viridis_pal(n = 10)

palette(viridis(n = 10))

# palette(brewer.pal(n = 11, name = "RdYlGn"))

col_palette_blue <- brewer.pal(n = 9, name = "PuBu")

3 Data Input

# Some options for quantmod package

options("getSymbols.warning4.0" = F)

To start with, we get Tesla stock data (ticker = “TSLA”) from Yahoo Finance by using the quantmod package. All that is required to download the data is the ticker of the corresponding financial instrument.

getSymbols(Symbols = "TSLA",
           src     = "yahoo",
           verbose = F)
## [1] "TSLA"

Second, we also get S&P 500 index (SPY ETF) data (ticker = “SPY”) from Yahoo Finance.

getSymbols(Symbols = "SPY",
           src     = "yahoo",
           verbose = F)
## [1] "SPY"

Next, we do some data wrangling to transform Tesla stock data into a tibble with the dplyr and tsbox packages and rename its columns. Tibbles are enhanced data.frames around which the tidyverse packages (and a great many other packages) are built. They provide a standardised way of storing data comming in diverse formats. I also use the pipe operator %>%, to make the workflow and required steps easy to grasp and adjust later on (see picture below for a short explanation).

Pipe

df_Tesla_stock_data <- TSLA %>%
    ts_tbl() %>%
    ts_wide() %>%
    rename(Date     = time,
           Open     = TSLA.Open,
           High     = TSLA.High,
           Low      = TSLA.Low,
           Close    = TSLA.Close,
           Volume   = TSLA.Volume,
           Adjusted = TSLA.Adjusted)

The Tesla stock data now looks like this, with daily observations for each trading day organised in the rows and seven different variables, also called features in the ML context, in the columns. For each of the daily 2’560 observations, we have the corresponding date in the Date column, the Openning stock price at trading start on the exchange, the daily Highest and Lowest price, the Close at end of trading, the trading Volume, and finally an Adjusted price, accounting for stock splits, dividends, and similar corporate actions.

datatable(df_Tesla_stock_data)

We do the same for the S&P 500 (SPY ETF) index data.

df_SPY_data <- SPY %>%
    ts_tbl() %>%
    ts_wide() %>%
    rename(Date     = time,
           Open     = SPY.Open,
           High     = SPY.High,
           Low      = SPY.Low,
           Close    = SPY.Close,
           Volume   = SPY.Volume,
           Adjusted = SPY.Adjusted)

The S&P 500 (SPY ETF) series has a few more observations than the Tesla series, i.e. data points on 3’438 days. Otherwise, it is in the same format. Here is how it looks like:

datatable(df_SPY_data)

Finally, we add both stock price time series together to have them available in a single tibble.

df_Tesla_SPY <- df_SPY_data %>%
    full_join(df_Tesla_stock_data,
              by     = "Date",
              suffix = c("SPY", "TSLA"))

In addition, we now compute the (continuous) stock returns for both financial instruments.

df_Tesla_SPY <- df_Tesla_SPY %>%
    mutate(ReturnsSPY  = log(AdjustedSPY) - lag(log(AdjustedSPY)),
           ReturnsTSLA = log(AdjustedTSLA) - lag(log(AdjustedTSLA)))

Next, we scrap Tweets data from Elon Musk’s and Tesla’ official Twitter account with the rtweet package. Unfortunately, only the most recent 3’212 tweets per user are available, because Twitter limits access to historical data in order to commercially offer it instead. Tweet scrapping requires a Twitter account and a developer registration for the free Twitter API. This is fairly easy to set up, however, and should only take a couple of minutes.

df_tweets_elon_musk <- get_timeline("elonmusk", n = 5000)

df_tweets_tesla     <- get_timeline("Tesla", n = 5000)

The Tweets dataset is rather big in size with 90 columns. Thus, only a subset of the columns are shown here to get an idea of how the data set for Elon Musk’s tweets looks like:

df_tweets_elon_musk %>%
    select(user_id, created_at, screen_name, text, source, is_quote,
           is_retweet, favorite_count, retweet_count, hashtags) %>%
    datatable(filter  = "top",
              options = list(pageLength = 5,
                             autoWidth  = T))

…and Tesla’s official Twitter account:

df_tweets_tesla %>%
    select(user_id, created_at, screen_name, text, source, is_quote,
           is_retweet, favorite_count, retweet_count, hashtags) %>%
    datatable(filter = "top",
              options = list(pageLength = 5,
                             autoWidth  = T))

4 Our First Plot - Time Series of Tesla’s Stock Price

Now we’re ready to take the Tesla stock price data and create a basic ggplot2 time series chart. We need the above mentioned Grammar of Graphics to set up each specific layer in the plot. First, we need to map the data to so-called aesthetics in the plot. Aesthetics are defined within the aes() function in ggplot2 and include plot specifications such as what goes on the x-axis and y-axis, what is shown in which colour, how the size of an object in a plot is determined and many more. For our basic time series plot, we simply map the Date column from the stock data to the x-axis and the Adjusted stock price to the y-axis. The only additional layer to add to get a finished plot now is a so-called geom (short for geometric objects). Geoms determine the kind of plot we want to display and are added with the set of geom_... functions. Here, we’d like to create a simple line plot with geom_line(). First, we add a new layer to the plot by using the + operator. Then we set the line geom and after saving the plot to a new R object we have our first plot.

p_basic_time_series_Tesla <- ggplot(data = df_Tesla_stock_data,
                                    aes(x = Date, y = Adjusted)) +  # Close
    geom_line()

p_basic_time_series_Tesla

So far, so good. However, the plot doesn’t look particularly great, does it? The grey background is rather irritating, the date on the x-axis is only displayed every five years, it’s unclear in what units the y-axis is shown, and in general, there’s no title or anything to really indicate what is exactly shown here. The only information we have is the evolution of the series over a time period of 10 years and its corresponding values on the y-axis. We need to adjust some basic layers of the plot.

For a visual overview and explanations of the different layers in ggplot2’s Grammar of Graphics, see this Towards Data Science article:

ggplot2 Grammar of Graphics

We start by adjusting the scales of the x- and y-axes in a new layer, the scales layer. We copy the code from above and additionally add scale_x_... and scale_y_.. functions with proper arguments.

p_basic_time_series_Tesla_w_scales <- p_basic_time_series_Tesla +
    scale_x_date(date_breaks = "1 year",
                 date_labels = "%Y") +
    scale_y_continuous(labels = scales::dollar,
                       breaks = seq(from = 0, to = 1750, by = 250))

p_basic_time_series_Tesla_w_scales

The theme of a plot is yet another layer in the “Grammar of Graphics”. Setting a beautiful theme will help us to get rid of the irritating grey background. Let’s try the theme_classic() function.

p_basic_time_series_Tesla_w_scales_and_theme <- p_basic_time_series_Tesla_w_scales +
    theme_classic()

p_basic_time_series_Tesla_w_scales_and_theme

theme_classic() is quite a beautiful and simplistic theme. For the purpose of interpreting a time series plot, however, a theme including a grid is more appropriate. Thus, in the following plots we use theme_light() instead. We also would like to add a proper title. Plot main and subtitles as well as axis labels are set with the labs() function. In addition, we accentuate the x- and y-axis by plotting it in thicker size than the remaining background grid lines. Let’s also adjust the label of the y-axis to make it clearer what it represents. Finally, let’s add a caption with the copyright for the plot. Now we have our first complete time series plot.

p_basic_time_series_Tesla_w_scales +
    theme_light() +
    theme(legend.text = element_text(),
          plot.title  = element_text(face = "bold"),
          axis.line   = element_line(size = 0.75)) +
    labs(title    = "Tesla Stock Price",
         subtitle = "Rising Higher and Higher...",
         y        = "Close (Adjusted)",
         caption  = "© Data Science & Technology Club HSG")

For the following plots, let’s set a global default ggplot2 theme, instead of adding it manually to each plot.

theme_set(theme_light())

To improve further on our plot, we can add a so-called benchmark to it. A benchmark is, e.g., another time series to compare the Tesla stock price with. We use the previously gathered S&P 500 prices to do exactly that. In order to be able to compare the prices of the two series and to get them into the same y-axis limits, some data wrangling and rebasing is required.

df_Tesla_SPY <- df_Tesla_SPY %>%
    mutate(AdjustedTSLARebased = AdjustedTSLA / first(df_Tesla_stock_data$Adjusted),
           AdjustedSPYRebased  = AdjustedSPY / first(df_SPY_data$Adjusted))

p_time_series_Tesla_vs_SPY <- df_Tesla_SPY %>%
    ggplot(aes(x = Date)) +
    geom_line(aes(y = AdjustedTSLARebased), col = palette()[4]) +
    geom_point(aes(x = last(Date),
                   y = last(AdjustedTSLARebased)),
               col = palette()[4],
               size = 2) +
    geom_text(label = "TSLA",
              aes(x = last(Date),
                  y = last(AdjustedTSLARebased)),
              color = palette()[4],
              hjust = 1.2,
              vjust = -1) +
    geom_line(aes(y = AdjustedSPYRebased), col = palette()[1]) +
    geom_point(aes(x = last(Date),
                   y = last(AdjustedSPYRebased)),
               col = palette()[1],
               size = 2) +
    geom_text(label = "S&P 500",
              aes(x = last(Date),
                  y = last(AdjustedSPYRebased)),
              color = palette()[1],
              hjust = 1.2,
              vjust = -1) +
    scale_x_date(date_breaks = "1 year",
                 date_labels = "%Y") +
    scale_y_continuous(labels = scales::percent,
                       breaks = seq(from = 0, to = 100, by = 10)) +
    labs(title    = "Tesla's Stock Price vs. S&P 500 Benchmark",
         subtitle = "Rising Higher and Higher...",
         y        = "Price Rebased (%)",
         caption  = "© Data Science & Technology Club HSG") +
    theme(legend.text = element_text(),
          plot.title  = element_text(face = "bold"),
          axis.line   = element_line(size = 0.75))

p_time_series_Tesla_vs_SPY
## Warning: Removed 878 row(s) containing missing values (geom_path).

It is pretty impressive by how much Tesla’s stock price outperforms the (already well performing) S&P 500. In particular beginning in mid October 2019, the volatility of the stock increases immensely, the sharp rise is contrasted by a sharp decline and a sharp rise again. It remains questionable, if Tesla’s recent stock price appreciation is sustainable and warranted in the long run. Let’s highlight the time during which Tesla’s stock price increase was most notable in the chart. We can do this with the annotate geom. Highlighting areas or specific parts of a chart is a useful element in story telling with data.

p_time_series_Tesla_vs_SPY +
    annotate(geom  = "rect",
             xmin  = as.Date("2019-10-15"),
             xmax  = last(df_Tesla_SPY$Date) + 15,
             ymin  = 0,
             ymax  = max(df_Tesla_SPY$AdjustedTSLARebased, na.rm = T),
             col   = "grey",
             alpha = 0.25) +
    annotate(geom  = "text",
             label = "High Volatility Period",
             x     = as.Date("2020-04-01"),
             y     = -2)
## Warning: Removed 878 row(s) containing missing values (geom_path).

5 Our Second Plot - Scatter Plot

Next, we turn to one of the most basic, but also most useful plots - the scatter plot. First, however, we compute average mean returns for both financial instruments.

df_Tesla_SPY_avg <- df_Tesla_SPY %>%
    summarise(SPY_mean  = mean(ReturnsSPY, na.rm = T),
              TSLA_mean = mean(ReturnsTSLA, na.rm = T))

We use geom_jitter() instead of geom_point() since this slightly and randomly dislocates individual observations in order to avoid overplotting, making the individual points better visible. Returns of the SPY go on the x-axis and returns of Tesla on the y-axis. We also highlight yesterday’s return, to see where it stands in comparison to historical returns. The if_else() function is pretty handy for this purpose.

p_scatter_Tesla_SPY <- df_Tesla_SPY %>%
    ggplot(aes(x = ReturnsSPY, y = ReturnsTSLA)) +
    geom_jitter(aes(col = if_else(Date == max(Date, na.rm = T), "Today", "Historical")),
                alpha = 0.5) +  # geom_point()
    # geom_vline(xintercept = df_Tesla_SPY_avg$SPY_mean) +
    scale_x_continuous(labels = scales::percent) +
    scale_y_continuous(labels = scales::percent) +
    scale_color_manual(name   = "Date",
                       values = c(col_palette_blue[6], "red")) +
    labs(title    = "Scatter Plot",
         subtitle = "SPY vs. TSLA Returns",
         x        = "SPY Returns (Continuous)",
         y        = "TSLA Returns (Continuous)",
         caption  = "© Data Science & Technology Club HSG") +
    theme(legend.text = element_text(),
          plot.title  = element_text(face = "bold"),
          axis.line   = element_line(size = 0.75))

p_scatter_Tesla_SPY
## Warning: Removed 879 rows containing missing values (geom_point).

Scatter plots are great to analyse the relationship between two (continuous) variables and are probably the most used charts in research and ML contexts. To check whether a linear relationship between returns of the SPY and Tesla exist, we can in addition add a regression line with geom_smooth(). The method argument is set to lm for linear model.

p_scatter_Tesla_SPY +
    geom_smooth(method = "lm",
                col    = "red")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 879 rows containing non-finite values (stat_smooth).
## Warning: Removed 879 rows containing missing values (geom_point).

By looking at the scatter plot and the dispersion of points, however, it is doubtful whether the relationship is truly linear. Thus, we can try to set another model, such as loess (local polynomial regression fitting), in geom_smooth().

p_scatter_Tesla_SPY +
    geom_smooth(method = "loess",
                col    = "red")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 879 rows containing non-finite values (stat_smooth).
## Warning: Removed 879 rows containing missing values (geom_point).

From these plots alone, it remains unclear what the true relationship between returns of the SPY and Tesla is. All we can say, is that Tesla on average seems to perform better when the U.S. stock market also performs well. However, the more extreme the returns are, the more uncertainty there is about the relationship, as indicated by the wider confidence intervals. This is due to the comparably little observations we have for extreme returns.

#

6 Our Third Plot - Bar Chart of Tesla’s Stock Volume

Create bar plot of TSLA stock volume

p_bar_Tesla_stock_volume <- df_Tesla_stock_data %>%
    ggplot(aes(x = Date, y = Volume)) +
    geom_col() +
    labs(title   = "Tesla Trading Volume - ...While Trading Volume Remained Constant Over Time",
         caption = "© Data Science & Technology Club HSG") +
    scale_x_date(date_breaks = "1 year",
                 date_labels = "%Y") +
    scale_y_continuous(labels = scales::dollar,
                       breaks = seq(from = 0, to = 70e6, by = 10e6))

p_bar_Tesla_stock_volume

We can play with the width argument in geom_col to adjust the width of the bins plotted.

p_bar_Tesla_stock_volume <- df_Tesla_stock_data %>%
    ggplot(aes(x = Date, y = Volume)) +
    geom_col(width = 0.2) +
    labs(title   = "Tesla Trading Volume - ...While Trading Volume Remained Constant Over Time",
         caption = "© Data Science & Technology Club HSG") +
    scale_x_date(date_breaks = "1 year",
                 date_labels = "%Y") +
    scale_y_continuous(labels = scales::dollar,
                       breaks = seq(from = 0, to = 70e6, by = 10e6))

p_bar_Tesla_stock_volume

7 Histogram - Tesla Stock Returns

First, we create a histogram to visualise the distribution of Tesla’s stock returns over time.

p_hist_Tesla <- df_Tesla_SPY %>%
    ggplot(aes(x = ReturnsTSLA)) +
    geom_histogram(bins  = 500,
                   col   = col_palette_blue[6],
                   alpha = 0.5) +
    labs(title    = "Histogram",
         subtitle = "Tesla Stock Returns",
         x        = "Continuous Returns",
         y        = "Count") +
    scale_x_continuous(label = scales::percent) +
    theme(legend.text = element_text(),
          plot.title  = element_text(face = "bold"),
          axis.line   = element_line(size = 0.75))

p_hist_Tesla
## Warning: Removed 879 rows containing non-finite values (stat_bin).

Then, we add a density to the distribution.

p_hist_Tesla <- p_hist_Tesla +
    geom_density(kernel = "gaussian",
                 col    = "red")

p_hist_Tesla
## Warning: Removed 879 rows containing non-finite values (stat_bin).
## Warning: Removed 879 rows containing non-finite values (stat_density).

Next, we add the average mean and medium return over time.

Tesla_returns_mean   <- mean(df_Tesla_SPY$ReturnsTSLA, na.rm = T)

Tesla_returns_median <- median(df_Tesla_SPY$ReturnsTSLA, na.rm = T)

p_hist_Tesla +
    geom_vline(xintercept = Tesla_returns_mean,
               col        = palette()[1]) +
    geom_vline(xintercept = Tesla_returns_median,
               col        = palette()[8])
## Warning: Removed 879 rows containing non-finite values (stat_bin).
## Warning: Removed 879 rows containing non-finite values (stat_density).

# Determine y-axis density position of median, mean, and confidence intervals

p_hist_Tesla <- df_Tesla_SPY %>%
    ggplot(aes(x = ReturnsTSLA)) +
    stat_density(aes(y = ..scaled..),
                 geom   = "line",
                 size   = 0.5,
                 col    = col_palette_blue[6],
                 adjust = 1) +
    labs(title = "Histogram - Tesla Stock Returns",
         x     = "Continuous Returns",
         y     = "Count") +
    scale_x_continuous(label = scales::percent) +
    theme(legend.text = element_text(),
          plot.title  = element_text(face = "bold"),
          axis.line   = element_line(size = 0.75))

mean_se <- sd(df_Tesla_SPY$ReturnsTSLA, na.rm = T) / sqrt(length(df_Tesla_SPY$ReturnsTSLA))

mean_conf_inter_l <- Tesla_returns_mean - 1.96 * mean_se

mean_conf_inter_u <- Tesla_returns_mean + 1.96 * mean_se

mean_pos_y <- ggplot_build(p_hist_Tesla)$data[[1]] %>%
    slice(which.min(abs(x - Tesla_returns_mean))) %>%
    pull(ndensity)
## Warning: Removed 879 rows containing non-finite values (stat_density).
mean_conf_inter_l_pos_y <- ggplot_build(p_hist_Tesla)$data[[1]] %>%
    slice(which.min(abs(x - mean_conf_inter_l))) %>%
    pull(ndensity)
## Warning: Removed 879 rows containing non-finite values (stat_density).
mean_conf_inter_u_pos_y <- ggplot_build(p_hist_Tesla)$data[[1]] %>%
    slice(which.min(abs(x - mean_conf_inter_u))) %>%
    pull(ndensity)
## Warning: Removed 879 rows containing non-finite values (stat_density).
p_hist_Tesla +
    geom_segment(x = Tesla_returns_mean,
                 xend = Tesla_returns_mean,
                 y = 0,
                 yend = mean_pos_y,
                 linetype = "solid",
                 color = col_palette_blue[6],
                 size = 0.4) +
    geom_point(x = Tesla_returns_mean,
               y = mean_pos_y,
               col = col_palette_blue[6])
## Warning: Removed 879 rows containing non-finite values (stat_density).

    # geom_area(x = mean_conf_inter_l,
    #              xend = mean_conf_inter_u,
    #              y = mean_conf_inter_l_pos_y,
    #              yend = mean_conf_inter_u_pos_y,
    #              linetype = "solid",
    #              color = "grey",
    #              size = 0.4)

8 Faceted Time Series Plot

If we want to display multiple series in a single plot, this is best done by using the ggplot2 facets layer. It is applied as a separate layer in our already existing time series plot. First, however, some data wrangling is required to transform the data from wide to long format.

df_Tesla_stock_data_long <- df_Tesla_stock_data %>%
    select(-Volume) %>%
    pivot_longer(cols      = -Date,
                 names_to  = "Variable",
                 values_to = "Values")

p_time_series_Tesla_faceted <- df_Tesla_stock_data_long %>%
    ggplot(aes(x = Date, y = Values, col = Variable)) +
    geom_line() +
    facet_wrap(. ~ Variable) +
    scale_x_date(date_breaks = "1 year",
                 date_labels = "%Y") +
    scale_y_continuous(labels = scales::dollar,
                       breaks = seq(from = 0, to = 1750, by = 250)) +
    scale_color_viridis(discrete = T) +
    labs(title   = "Faceted Stock Price Time Series - Tesla",
         y       = "Stock Price",
         caption = "© Data Science & Technology Club HSG")

p_time_series_Tesla_faceted

9 Interactive Plots with Plotly

To add some more spice to the previously built plots, we can turn them into interactive web graphs. This is where the plotly package comes in play. It is built on the plotly.js (Java Script) library and extremely useful and versatile when it comes to interactive plots used in reports, dashboards or web pages.

p_time_series_Tesla_faceted %>%
    ggplotly()
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
# FIXME: Annotation doesn't work yet

p_time_series_Tesla_vs_SPY <- p_time_series_Tesla_vs_SPY %>%
    ggplotly() %>%
    layout(annotations = list(x = 1, y = 1,
                              text = "© Data Science & Technology Club HSG"))

p_time_series_Tesla_vs_SPY

10 Elon Musk’s Tweets

We start by computing the number of tweets Elon Musk writes per day and show summary statistics of those.

df_tweets_elon_musk_per_day <- df_tweets_elon_musk %>%
    mutate(Date = as.Date(created_at)) %>%
    group_by(Date) %>%
    summarise(TweetsN = n())
## `summarise()` ungrouping output (override with `.groups` argument)
df_tweets_elon_musk_per_day %>%
    summarise(Min            = min(TweetsN, na.rm = T),
              `1st Quartile` = quantile(TweetsN, probs = 0.25),
              Median         = median(TweetsN, na.rm = T),
              Mean           = round(mean(TweetsN, na.rm = T), digits = 2),
              `3rd Quartile` = quantile(TweetsN, , probs = 0.75),
              Max            = max(TweetsN, na.rm = T)) %>%
    datatable(caption = htmltools::tags$caption(tyle = "caption-side: bottom; text-align: center;",
                                                "Table 1: ",
                                                htmltools::em("Summary statistics of daily tweets by Elon Musk.")))

Next, we create a bar plot to visualise the number of tweets per day.

p_bar_tweets_elon_musk <- df_tweets_elon_musk_per_day %>%
    ggplot(aes(x = Date, y = TweetsN, fill = TweetsN)) +
    geom_col() +
    scale_x_date(date_breaks = "1 month",
                 date_labels = "%Y %b") +
    scale_y_continuous(breaks = seq(0, 60, 10)) +
    labs(title = "Tweets by Elon Musk",
         x     = "Month",
         y     = "Number of Tweets") +
    scale_fill_binned(type = "viridis") +
    theme(axis.text.x = element_text(angle = 60,
                                     hjust = 1))

p_bar_tweets_elon_musk <- p_bar_tweets_elon_musk %>%
    ggplotly()

p_bar_tweets_elon_musk

We can compare this to the evolution of Tesla’s stock price.

subplot(p_time_series_Tesla_vs_SPY,
        p_bar_tweets_elon_musk,
        nrows  = 2,
        shareX = T)

Let’s see whether the number of tweets by Elon Musk per day are associated in any way with returns of Tesla’s stock. We naively try to do this with a scatter plot first.

df_Tesla_EM_tweets <- df_Tesla_SPY %>%
    full_join(df_tweets_elon_musk_per_day,
              by = "Date") %>%
    select(Date, ReturnsTSLA, TweetsN)

p_scatter_Tesla_EM_tweets <- df_Tesla_EM_tweets %>%
    ggplot(aes(x = TweetsN, y = ReturnsTSLA)) +
    geom_jitter(col   = col_palette_blue[6],
                alpha = 0.5) +
    scale_x_continuous(breaks = seq(0, max(df_tweets_elon_musk_per_day$TweetsN, na.rm = T), 10)) +
    scale_y_continuous(labels = scales::percent) +
    labs(title    = "Number of Daily Tweets vs. TSLA Returns",
         subtitle = "Scatter Plot",
         x        = "Number of Tweets per Day",
         y        = "TSLA Returns (Continuous)",
         caption  = "© Data Science & Technology Club HSG") +
    theme(legend.text = element_text(),
          plot.title  = element_text(face = "bold"),
          axis.line   = element_line(size = 0.75))

p_scatter_Tesla_EM_tweets %>%
    ggplotly()

Just from looking at the scatter plot, it’s hard to tell. Hence, we produce a boxplot with the same underlying data as before. To do this, we need to sort the number of tweets into so-called “bins”. We choose a bin number of 12, thus splitting the number of tweets in bin widths of approximately 5.

p_boxplot_Tesla_EM_tweets <- df_Tesla_EM_tweets %>%
    mutate(TweetsN = cut(TweetsN, breaks = 12)) %>%
    filter_all(~ !is.na(.)) %>%
    ggplot(aes(x = TweetsN, y = ReturnsTSLA, col = TweetsN)) +
    geom_boxplot() +
    scale_y_continuous(labels = scales::percent) +
    scale_color_viridis_d() +
    labs(title    = "Number of Daily Tweets vs. TSLA Returns",
         subtitle = "Boxplot",
         x        = "Number of Tweets per Day",
         y        = "TSLA Returns (Continuous)",
         caption  = "© Data Science & Technology Club HSG") +
    theme(legend.text = element_text(),
          plot.title  = element_text(face = "bold"),
          axis.line   = element_line(size = 0.75))

p_boxplot_Tesla_EM_tweets %>%
    ggplotly()
# Get Tesla tweets

df_tweets_elon_musk_Tesla <- df_tweets_elon_musk %>%
    filter(str_detect(text, pattern = "Tesla"))

11 Wordcloud Plot - Elon Musk’s Tweets

# TODO

# wordcloud2()

12 Candelstick Chart - Tesla’s Stock Price

Candlestick chart

df_Tesla_stock_data %>%
    plot_ly(x     = ~ Date,
            type  = "candlestick",
            open  = ~ Open,
            close = ~ Close,
            high  = ~ High,
            low   = ~ Low) %>%
    layout(title = "Candlestick Chart - Tesla Stock Price")
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
df_Tesla_stock_data %>%
    filter(Date >= "2020-01-01") %>%
    plot_ly(x     = ~ Date,
            type  = "candlestick",
            open  = ~ Open,
            close = ~ Close,
            high  = ~ High,
            low   = ~ Low) %>%
    layout(title = "Candlestick Chart - Tesla Stock Price")

OHLC

# TODO: Add nicer colors

p_LC <- df_Tesla_stock_data %>%
    ggplot(aes(x = Date, y = Adjusted)) +
    geom_line(size = 1) +
    geom_line(aes(y = Low),
              col      = palette()[1],
              linetype = "dashed") +
    geom_line(aes(y = High),
              col      = palette()[8],
              linetype = "dashed") +
    geom_ribbon(aes(ymin  = Low,
                    ymax  = High),
                alpha = 0.4) +
    labs(title = "Tesla Trading Volume - ...While Trading Volume Remained Constant Over Time") +
    scale_x_date(date_breaks = "1 year",
                 date_labels = "%Y") +
    scale_y_continuous(labels = scales::dollar)

p_LC %>%
    ggplotly()

13 Animated Plots

Simple animated time series plot with plotly abd the gapminder data

gapminder %>%
    filter(country %in% c("China", "United States", "United Kingdom", "India",
                          "Germany", "Switzerland", "Austria", "Japan", "Singapore")) %>%
    plot_ly(x      = ~ lifeExp,
            y      = ~ gdpPercap,
            size   = ~ pop,
            color  = ~ country,
            frame  = ~ year,
            type   = "scatter",
            mode   = "markers",
            colors = palette())